Skip to content

refactor(context): deduplicate register/read option-building logic#1479

Open
mesejo wants to merge 1 commit intoapache:mainfrom
mesejo:refactor/context-register-read
Open

refactor(context): deduplicate register/read option-building logic#1479
mesejo wants to merge 1 commit intoapache:mainfrom
mesejo:refactor/context-register-read

Conversation

@mesejo
Copy link
Copy Markdown
Contributor

@mesejo mesejo commented Apr 5, 2026

Which issue does this PR close?

N/A

Rationale for this change

Reduce the number of execution paths for register and read functions

What changes are included in this PR?

Extract shared helpers (convert_partition_cols, convert_file_sort_order, build_parquet/json/avro_options, convert_csv_options), standardize path types to &str, and remove redundant intermediate variables.

Are there any user-facing changes?

No

Extract shared helpers (convert_partition_cols, convert_file_sort_order,
build_parquet/json/avro_options, convert_csv_options), standardize path
types to &str, and remove redundant intermediate variables.
@mesejo mesejo force-pushed the refactor/context-register-read branch from fe84781 to 39e190f Compare April 5, 2026 16:17
@mesejo mesejo marked this pull request as ready for review April 5, 2026 16:38
@timsaucer timsaucer requested a review from Copilot April 8, 2026 17:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors PySessionContext read/register APIs to reduce duplicated option-building logic across Parquet/CSV/JSON/Avro operations.

Changes:

  • Extracted shared helper functions to convert partition columns / sort order and to build Parquet/JSON/Avro/Csv read options.
  • Simplified read/register implementations by removing intermediate variables and consolidating async wait patterns.
  • Standardized some path parameters from PathBuf to &str for JSON/Avro APIs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 868 to 872
pub fn register_json(
&self,
name: &str,
path: PathBuf,
path: &str,
schema: Option<PyArrowType<Schema>>,
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing path from PathBuf to &str in this #[pymethods] API is a breaking change for Python callers that previously passed pathlib.Path / other os.PathLike objects (PyO3 can extract those into PathBuf, but not into &str). Consider keeping PathBuf (and moving the to_str() conversion into a helper), or accept a Bound<'_, PyAny> and explicitly support both str and os.PathLike inputs to preserve the existing Python surface area.

Copilot uses AI. Check for mistakes.
Comment on lines 1004 to 1009
#[allow(clippy::too_many_arguments)]
#[pyo3(signature = (path, schema=None, schema_infer_max_records=1000, file_extension=".json", table_partition_cols=vec![], file_compression_type=None))]
pub fn read_json(
&self,
path: PathBuf,
path: &str,
schema: Option<PyArrowType<Schema>>,
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_json now takes path: &str instead of PathBuf, which likely removes support for pathlib.Path / os.PathLike inputs in the Python API. If this is intended, it should be treated as user-facing and called out; otherwise keep the previous PathBuf signature (or accept a generic Python object and convert via os.fspath).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

@timsaucer timsaucer Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one is a valid concern, including the other two places copilot found

Comment on lines 890 to 900
#[allow(clippy::too_many_arguments)]
#[pyo3(signature = (name,
path,
schema=None,
file_extension=".avro",
table_partition_cols=vec![]))]
pub fn register_avro(
&self,
name: &str,
path: PathBuf,
path: &str,
schema: Option<PyArrowType<Schema>>,
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as JSON: changing register_avro's path parameter from PathBuf to &str breaks Python callers that pass pathlib.Path/os.PathLike. Prefer keeping PathBuf (or explicitly supporting os.fspath conversion) to avoid an API regression.

Copilot uses AI. Check for mistakes.
Comment on lines 868 to 878
pub fn register_json(
&self,
name: &str,
path: PathBuf,
path: &str,
schema: Option<PyArrowType<Schema>>,
schema_infer_max_records: usize,
file_extension: &str,
table_partition_cols: Vec<(String, PyArrowType<DataType>)>,
file_compression_type: Option<String>,
py: Python,
) -> PyDataFusionResult<()> {
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says "No user-facing changes", but the switch from PathBuf to &str for JSON/Avro read/register methods is user-facing for Python (it changes which argument types are accepted, e.g., pathlib.Path). Either restore the previous behavior or update the PR description / changelog notes accordingly.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind merging main and giving register_arrow and read_arrow the similar treatment?

@timsaucer
Copy link
Copy Markdown
Member

Nice cleanup!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants